Identification of A Gene Set Associated with Colorectal Cancer in Microarray Data Using The Entropy Method

Objective We sought to apply Shannon’s entropy to determine colorectal cancer genes in a microarray dataset. Materials and Methods In the retrospective study, 36 samples were analysed, 18 colorectal carcinoma and 18 paired normal tissue samples. After identification of the gene fold-changes, we used the entropy theory to identify an effective gene set. These genes were subsequently categorised into homogenous clusters. Results We assessed 36 tissue samples. The entropy theory was used to select a set of 29 genes from 3128 genes that had fold-changes greater than one, which provided the most information on colorectal cancer. This study shows that all genes fall into a cluster, except for the R08183 gene. Conclusion This study has identified several genes associated with colon cancer using the entropy method, which were not detected by custom methods. Therefore, we suggest that the entropy theory should be used to identify genes associated with cancers in a microarray dataset.


Introduction
Cancer is one of the leading causes of death in both developed and developing countries. Increasing life expectancy will cause a worldwide increase in the cancer burden, especially in less developed countries (1). In 2012, 14.1 million new cancers were detected, with an estimated 8.2 million deaths from cancer worldwide (2).
Colorectal cancer is one of the most common types of cancer. Despite progress in screening and diagnostic methods, it is the third most common cancer in the world. In addition, it is ranked fourth and the fifth among cancers in developed and undeveloped countries, respectively. Worldwide, colon cancer is the third most frequent cancer in males and second most frequent in females. Colorectal cancer comprises 10% of all malignancies in males and 9.2% of total cancers in women. Approximately 55% of people with colorectal cancer live in developed countries (2).
Colorectal cancer in the European countries has the highest incidence among malignancies and the second leading cause of death among malignant diseases in the countries (3). The latest studies show that the annual rate of colorectal cancer is increasing worldwide (2,4). However, over 95% of colorectal cancer can be treated if detected early (5).
Therefore, early diagnosis of colorectal cancer and identification of cancer prognosis is very important.
One of the prognostic factors for colorectal cancer is the gene set associated with this disease. We can use gene expression information extracted by microarray technology to determine the gene set associated with colorectal cancer. At the moment, microarray data has been used to determine the disease prognosis and the classification of genes associated with cancers.
One of the debatable issues in analysis of microarray data is the selection of a range of genes associated with cancer due to the large number of genes examined compared to the number of cases in the microarray data. This may lead to bias in gene selection and classification (6). Therefore, to solve this problem, advanced mathematical methods can be used to reduce the number of genes (7).
Shannon's entropy is one of the techniques for reducing the dimension of a large dataset such as microarray data that has recently been considered by researchers. Researchers use entropy to classify the genes into categories according to gene similarities and dissimilarities. The gene selection algorithm is again used to modify the selected gene list so that at least one subset gives the desired accuracy of the classification. The present study has used Shannon's entropy method to determine up-regulated, overexpressed genes associated with colorectal cancer. These genes could potentially be used as a new therapeutic target.

Dataset
In the retrospective study, we used data from a study by Notterman et al. (8)

Shannon's entropy
The information from the text file provided by Notterman et al. (8) was exported to SPSS 16 version 16.0 for Windows ( Inc., Chicago, IL). Then, we separately calculated the average of the gene expressions in the tumour and normal tissues. Next, we determined the fold-change (calculated formula) with respect to equation 1.
Where, ave(C) and ave(N) denote expression intensity levels of the tumour and normal tissues.
We used Shannon's entropy theory (equation 2) to select a gene set that affected colorectal cancer, which had the most mathematical information about colorectal cancer (9). In addition, the uncertainty of genes was measured by equation 3. In other words, the interdependency of two genes, X and Y, was defined as: In equation 3, H(X,Y), H(X), and H(Y) are the mutual information, the entropy of gene-X and gene-Y, respectively. The normalized mutual information, U(X,U), between the two genes (e.g., X and Y) was defined as: The values of one and zero for U(X,Y) denote that genes X and Y have a high mutual relevance (e.g., dependent) and low mutual relevance (e.g., independent), respectively.
If S is a collection of selected genes, the degree of suitability and complementarity of the genes are determined by equations [5] and [6], respectively.
where g i represents i th the gene and represents the corresponding cluster. Gene set S must be selected such that the gene relevance rate is maximized (equation 5), while the gene excess rate is minimized (equation 6).

Data clustering
After selecting a set (S) of genes, which had the most information on colorectal cancer using the entropy technique, we applied a two-way hierarchical clustering method to categorise them into clusters. First, we put each gene in a vector to cluster the genes; second, we used the Euclidean distance to determine the distance between the genes (10). We used MATLAB (version 8) to determine a set, S as a collection of selected genes, the degree of suitability and complementarity of the genes is given in equations 5 and 6. In addition, the EntropyExplorer and Heatmap packages of R3.2.2 software were used to compute the entropy information genes and dendrogram drawings.

Results
We assessed 36 tissue samples, 18 adenocarcinoma tissues and 18 paired normal samples. In the initial assessment of 7465 cDNAs and expression sequence tags (ESTs) available (http://genomics-pubs.princeton. edu/oncology), 3128 genes that had a fold-change greater than one were defined. In the second stage, we implemented the entropy theory and selected a set of 29 genes, which had the most mathematical information on colorectal cancer. Table 1 lists these genes. Table 2 shows the gene name, aliases and locations. In addition, comparison between results of the study and previous studies was shown in Table 2.
The dendograms of genes clustered are shown in The cluster analysis on expression intensity of 3128 genes indicates that they can be divided into 3 clusters. In clusters 1 and 3 ( Fig.1), it is clearly seen that expression intensity in normal tissue is more than tumour tissue (darker colour indicates greater expression), whereas expression intensity of normal samples in cluster 2 appears to be less than the tumour samples.
Therefore, in the final data track, genes were considered that had greater expression of tumour tissue compared to normal tissue, as shown by a foldchange greater than 1.
Although the primary cluster analysis divided the genes into 3 clusters, we did not discover any regular pattern.
A Gene Set Associated with Colorectal Cancer

Entropy analysis
After we selected 29 genes associated with colorectal cancer according to the entropy theory (Table 1), we attempted to cluster them in terms of gene expression intensity level in two directions, gene and tissue. The vertical axis of Figure 2 shows that all genes fall into a cluster, except for the R08183 gene. Fig.2: Cluster map derived from two-way cluster analysis with the hierarchical method. We combined 29 common genes in tumour and normal tissues in a matrix. Clustering was performed on this matrix. Each colour patch on the cluster map indicates the expression intensity level of the associated gene in that tumour and normal tissue samples. The colour patches on the cluster map have continuity on expression levels from yellow (highest) to red (lowest).

Discussion
The present study reported the application of the entropy theory to identify and select the most important gene set associated with colorectal cancer in a large dataset such as the microarray dataset. Also, we used a two-way hierarchical clustering algorithm approach to cluster the genes.
The method used in our work, unlike conventional methods, considers the correlation between genes and uses the normalized mutual information (e.g., relevance and redundancy between genes). In this technique, the number of genes that contain colorectal cancer information increase and the number of unrelated genes (e.g., genes that give little notice to cancer) decrease. In many studies for gene clustering, the correlation between genes is not used; hence, the results may not be valid. In this study, we have taken into consideration the correlation between genes in their selection process. Under the current study, there were very few folding coding genes (transcripts) higher than 2 that agreed with the results reported by Notterman et al. (8). In analysing the microarray data, both the up-regulated and the down-regulated genes were important; however, we only assessed up-regulated genes in this study.
Our study found 29 genes associated with colorectal cancer, which were more genes attributed to colorectal cancer compared to the Notterman et al. (8) study. The reason for this was to use the entropy theory in our study, whereas the previous study did not use normalized mutual information between the genes. A comparison of the results of our study with those reported by Notterman et al. (8) showed that both studies agreed with the discovery of 12 genes associated with colorectal cancer. However, the current study identified 17 genes associated with colorectal cancer, which were not identified in the Notterman et al. study. Their study confirmed 6 genes (KIAA0101, GRO-g, L-iditol-2 dehydrogenase, RNA POL II subunit, myoblast cell surface antigen 24.1 DS, and GTF3A) associated with colorectal cancer, which we did not identify. These genes do not have a large amount of fold-change. We identified genes that had a large fold-change in the current study. Of the 17 genes we discovered, 14 (82.35%) had a fold-change over 20. Therefore, it could be seen that the method used in this study more effectively discovered these genes compared to other studies.
In this study, we used cluster analysis to categorise 29 genes into 2 clusters. The first cluster included 28 genes and the second cluster contained only the R08183 gene. Liu et al. (7) did not observe this finding in their study. They detected 9 genes at the reduced final feature set, which was much lower than the number of genes identified in the current study. Our dendrogram showed that the difference between R08183 expressions in cancer tissue compared to normal tissue was much higher than other genes. This finding was not confirmed by Liu et al. (7). We included only the up-regulated genes in the model, whereas they included both up-regulated and down-regulated genes.
Our study showed that 3 genes (R37640, M94363, and L13616) had fold-changes greater than 100, whereas Liu et al. (7) did not refer to any of these genes as colorectalrelated genes. Saucier and Rivard (11) and Pabla et al. (12) also showed an association of the R37640 gene with colorectal cancer. Brackenridge et al. (13) same our study reported a significant association between the M94363 gene and colorectal cancer. An association between the L13616 gene with colorectal cancer was confirmed by Golubovskaya et al. (14), Lark et al. (15), and in the current study. However, Liu et al. (7) and Notterman et al. (8) did not observe this association. Given the different results obtained, we propose to emphasize these genes in future studies.
Recently, a method has been introduced based on the kernel function for selection and clustering of genes. We did not use this method due to the difficulty of choosing the kernel function (16).

Conclusion
This study identified several genes associated with colon cancer by the entropy method, which have not been detected by custom methods. Therefore, we propose that researchers use the entropy method to identify genes associated with cancers in a microarray dataset.